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1. INTRODUCTION 


In the practical applications of pattern recognition (such as in the process- 
ing of remotely sensed imagery data), obtaining labels is a difficult problem. 
Acquiring labels is expensive, and very often these labels are imperfect. 

Several scientists have investigated the problem of pattern recognition with 
imperfectly labeled' patterns (refs. 1-7). Duda and Singleton (ref. 1) showed 
that, for orthogonal pattern vectors, the average weight vector of a threshold 
logic unit converges to a solution weight vector for the correctly labeled 
pattern set. Kashyap (ref. 2) proposed an iterative training procedure for a 
two-^class case. Shanmugam and Breiphol (ref. 3) developed an error-correcting 
procedure for disjoint densities using Parzen estimators. Chittineni 
(y-efs. 4-7) investigated the problem of learning with imperfectly labeled pat- 
terns and studied the applicability of probabilistic distance measures for 
feature selection with imperfectly labeled patterns. Most of these proposed 
schemes require the knowledge of probabilities of label imperfections, which 
usually are not available. 

Several authors considered the problem of estimating recognition system per- 
formance (refs. 8-13). Highleyman (ref. 8) investigated the problem of estimat- 
ing the probability of error of a given classifier both for known and unknown 
a priori probabilities. Fukunaga and Kessell (ref. 9) examined the problem 
of estimating the probability error from unclassified samples. Havens et al. 
(ref. 10) reported the experimental results of estimating the probability of 
error from unclassified samples using remotely sensed agricultural data. 

Chow (ref. 11) established a relationship between error and rejection rates 
which is useful in estimating the probability of error from unclassified 
samples. 

In practice, the situation often arises in which a set of imperfectly labeled 
test patterns and a set of unlabeled patterns are available. (For example, 
in remote sensing, a set of labeled patterns called type 2 dots and a set of 
unlabeled patterns are usually available). This paper presents the problem of 
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estimating recognition system performance and label imperfections as maximum 
likelihood estimates from the classifier decisions of labeled and unlabeled 
patterns. The probabilities of the estimated label imperfections are then 
used in 'developing schemes for the identification of mislabeled patterns. 

The paper is organized in the following manner. 

Assuming no imperfections in the labels, expressions are derived for the maxir 
mum likelihood estimates of probability of error, probability of correct clas- 
sification, and a priori probabilities (section 2); also, in this section, 
expressions are derived for the asymptotic variances of probability of correct 
classification and a priori probabilities. In section 3, imperfections in the 
labels are introduced, models for the label imperfections and probabilities 
of errors are developed, and the simulation results from the processing of 
remotely sensed data are presented. Methods of identifying mislabeled pat- 
terns for both two-class and multiclass cases are reported in section 4, and 
the results of their applications in processing remotely sensed data are 
described. Conclusions are presented in section 5. 
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2. MAXIMUM LIKELIHOOD ESTIMATION OF PROBABILITY OF ERROR, 

PROBABILITY OF CORRECT CLASSIFICATION, AND 
A PRIORI PROBABILITIES 

In this section, expressions are derived for the maximum likelihood estimates 
of probability of error, probability of correct classification, and propor- 
tions. Also, expressions for the asymptotic variance of probability of cor- 
rect classification and proportion estimates are derived. It is assumed that 
the classifier is designed and the classifier classifications of a set of 
labeled and unlabeled patterns are obtained, [In a situation involving remote 
sensing, the labeled patterns are the test set or type 2 dots and the unlabeled 
patterns are the spectral values of the picture elements (pixels) for which no 
labels are available.] In this section, the labels of the test patterns are 
assumed perfect; in section 3, the labels are assumed to be imperfect. The 
classifier classifications of the labeled and unlabeled sets are illustrated 
in table 2-1. 

Let u be the given label and to be the classifier label. Let A. . = 

[P(to = iloj^ = j) be the probability that the true label is i, given that the 
classifier label is j. Let p^.^- = P{w = i,-to^ = j) be the probability that 
the true label of the pattern is i and the classifier Jabel is j. Let 
P^(i) = P((o^ = i) be the probability that the classifier classifies a pattern 
into class i and P^. = P(oj = i) be the a priori probability of class i. Then 
we obtain 

= P((o = i,w^ = j) 

= P(w^ = j)P(w = i = j) 

= ( 2 - 1 ). 
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TABLE 2-1.- CLASSIFICATIONS OF LABELED AND UNLABELED SETS 
(a) Confusion matrix of labeled test set 


True label 

Classifier label 

Number belonging 
to each class 

1 

2 

• • * 

M 

1 

"'ll 


* # • 

•"IM 

'"l. 

2 

"'21 

*"22 

• • • 

'"2H 

•"2. 

* 

• 

• 

• 

• 

• 

• 

« 


• 

• 

• 

> • 

* 

• 

M 


"'M2 

• • • 

""mm 

M. 

Number classified 
into each class 

"’.1 

m2 

• • » 

'".M 

m = m 


(b) Matrix- of classifications of unlabeled set 


Classifier label 

1 

2 


M 

»i 


« « * 



where 



M 


m 


1 . 



= number of labeled patterns for which the true or given label is 1 
and the classifier label is j 

= number of classes 



M 


1=1 


"’ij 


A ^ 

m - in = 2-) S >«•;-}> the total number of labeled patterns 
i=l j=l '' 

X^. = number of unlabeled patterns for which the classifier label is j 
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since each classification is independent, the likelihood function of the 
observed m*s and X's can be written as 


^ M M „ 

- c TTn {p ) « TT [p,.(j)] j 

i=l j=l .1=1 ^ 


0 

M M 


j=l 

M 


-cTTTT (..)"- TTep,u)]V"o 

i=l j-1 j=l ^ 


( 2 - 2 ) 


where C is a constant. The constraints on X. . and P (j) are 

1J c 


M 

E 

i=l 


j = 1 ; J = 1 


M 


> 


r PJa) = 


j=i 


(2-3) 


The objective is to find the values for X.. and P^(j) which maximize L. sub- 
ject to the constraints of equation (2-3). Since the logarithm 1s a monotonlc 
function of Its argument, taking the logarithm of L and introducing Lagranglan 
multipliers yields 

MM M 

L' = log C + 2 2 log(X,.) + 2^ (X. + m .)logCP^(j)] 


M 


1=1 j=l 
M 




la 


+ H rj X) X.. - Ij + 

j=1 


3=1 
■ M 

E PJJ) 

j=l 


(2-4) 


where r^. (j = and s are Lagrangian multipliers. Differentiating 

L with respect to Pjs(j) and s, equating the resulting expressions to zero. 


and solving for P^(j) results in 


" M 


m 4 + X,- 

• J J 


(2-S) 


£=1 
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similarly, the maximum likelihood estimate of X. . can be obtained as 


X 


= "111 

ij ■ m . 

• J 


(2-6) 


From the invariance property of the maximum likelihood estimators, the maxi- 
mum likelihood estimate P for the probability of correct classification 
P can be obtained from the expression 

W 


M 

^ P(w i»w^ = i) 
cc c 

M 

i=l ^ ^ 

M 

= Yj Pp(i)\-,* 

i=l ^ 

Using equations (2-5) and (2-6) in equation (2-7). yields 

M 

E m - . 

(m . + X.) 

ft _ i=l '".i - 

*^cc M 

2 (m „ + X ) 

A=1 * 


(2-7) 


(2-8) 


An intuitive justification for P may be given as follows. The ratio 
(m..|m .) gives the proportion of the patterns truly belonging to class i to 
the patterns classified into class i. Multiplying this ratio by (m 4 + X- ) 

• I I 

and summing it from 1 to M gives an estimate for the number of correctly clas- 
sified patterns from all patterns in the classified classes. The estimate of 
P is then divided by the total number of patterns. An estimate P. for 

C * 

the proportion P. may be obtained as follows. 


2-4 



p^. = p(u = 1) 

M 

= 2 P((U = 1,(0 = j) 

j=l 


M 

= 2 P(w = j)P(w = i|w = j) 
j=l "" 


M 

j=i 


(2-9) 


From equations (2-5), (2-6), and (2-9), the following is obtained. 


>N 




( 2 - 10 ) 


Different probabilities of error can be written as 


P((o_ = j)P(io = i |w_ = j) 
P(o)^ = jIw = i) = p(co = i) 


( 2 - 11 ) 


Using equations (2-5), (2-6), and (2-10) in equation (2-11) obtains the maxi- 
mum likelihood estimates [P((o^, = jlw = 1)] for different probabilities of 
error. 



( 2 - 12 ) 


The estimate of equation (2-12) can be interpreted as follows. It is the 
ratio of the number of patterns that truly belong to class i but were classi- 
fied into class j to the total number of patterns that truly belong to class i 
from the patterns classified into all classes. 


2-5 



In the following example, expressions are derived for the asymptotic variance 
of the estimates of the probability of correct classification and proportions. 
From equation (2-7), the estimated can be written as 



(2-13) 


The delta method {ref. 14) is used to compute the asymptotic variance of 

This involves expanding P ^ in a. Taylor series around the true value 
M 

P = X) P (i)X.^. The result of this expansion is 

CC ^ _*| C 1 I 


M M 


3P.. 3P,.,- 

CC CC 




• J1 J1 
+ 


3P__ 3P 


E.i:covp.,p,(j)]55^^ 


M M' 


3P„ 3P. 




* n • n < ^ ^ 

1=1 J=1 ' 


(2-14) 


The number of independent parameters is 2M - .1; namely, »^22’* " *^MM 
P (1),P (2),“*,P^(M - 1). If these parameters are labeled by 6^., 
i = 1,2,--*,2M-1, the (2M - 1) by (2M - 1) information matrix, the general 

tem of which is given by evaluated from equation (2-2). 

Carrying out these calculations and inverting the resulting matrix yields the 
variance-covariance matrix of i = 1,2,***,M, and Pj,(j)s 3 ” 1,2,»**,M-1. 
From this, the following are obtained. 
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PJi)D - Pji)] 


Var 


L C 

N 

(2-15) 



I PJi)PJ3')- 


Cov 



(2-16) 


Var(x., 

X,,(l ■ x„) 
/ mP^(i) 

(2-17) 

Cov[x.,P^(J)] 

= Covj^ 

Vi)Xjj] = cov(x,,y = o 

(2-18) 

for all i and j, i ^ k, where 




M 

N = Z) X (2-19) 

j'=l ^ 


Substituting equations (2-5) through (2-19) into equation (2^14) yields an 
expression for the Var(p^^j as follows. 


X..(l 

n ^ 


mP^(i) 


M M 
^ 1=1 j=l 


[-Pc(i)Pc(j)3 


+ 


Y- Pc(i)n - Pc(i)] 

M 



j I 

.(i) + L, 


C' '11 


M M 

ZS 

i-l j=l 


N 


Vlj;A^^.)P^(i) 


( 2 - 20 ) 


Following a similar analysis, an expression may be obtained for the asymptotic 
variance of the a priori probability estimator (ref. 15). 
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( 2 - 21 ) 


M 


Var(p. ) = E 

’ J=1 


r M 
E 
u=i 


P,(j)X^j 


P^(i) 
c' ' 


m 


N 


In general, one can. obtain expressions for sample sizes m and N, either by 
minimizing the Var(p^^) or by minimizing the Var(p. ), subject to some cost 
constraints. 
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3. MAXIMUM LIKELIHOOD ESTIMATION WITH LABEL IMPERFECTIONS 


In practical situations, obtaining labels is expensive, and Very often these 
labels are imperfect. In this section, we formulate the problem of estimat- 
ing, with imperfections in the labels, the various quantities considered in 
section 2. 

It is assumed that the classifier is trained on representative data, and a 
set of labeled patterns (possibly with imperfect labels) and a set of 
unlabeled patterns are presented to the classifier. The classifier classi- 
fies these patterns, and the results are matrices similar to table 2-1. 

Now the various quantities are defined as follows. 

Let w' be the imperfect label, P^. = P(w' = i) be the a priori probability that 
the imperfect label is i, p'-. = P(cj)' = i,o) = j) be the probability that the 

Ij c 

imperfect label is i, and j be the classifier label. Consider 


= P(co' 

= ' 3 ) 




M 





Jl=l 

P(oj' = i ,0) = 

= 

j) 


M 






P((o' = i|u = 

A.W * 

O)P(O) = 

^ ~ 3 ) 

£=1 





M 





-:E 

5,=1 

P(o)' = i|o) = 


= j ico = 

Jt)P((0 = 1) 


where it is assumed that 

P(oj' - i 1(0 = 5.) = P(o)' = i [(0 == = j) (3-2) 

This assumption states that, given the true label and the classifier label, 
the imperfect label depends only on the true label. This is a reasonable 
assumption. In acquiring the label for a pattern, the labeler depends 
heavily on the true label of the pattern and virtually does not know the 
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classifier label. (In labeling a pixel in imagery data, the assigned label 
depends on the true label of the pixel and its neighbors and on some other 
data such as ancillary information.) Now consider 

P^ij) = = O) 


M 

= S P(w 


= = a) 


M 

= P(w_ j|w = X.)P(w = a) (3-3) 

il=l 


Substituting equations (3-1) and (3-3) into the likelihood function and 
taking the logarithm results in 


M M . 

L = log C + EE log 


i=l j=l 


10 


M 


2 P(u»' = iiw = s-)p(w^ = jlo3 = il)P(w = A) 
£=1 ^ 


M 


+ 2 X. log 
j=l ^ 


M 


^2 P(w- = jja) = Jl)P(a) = H) 


Jl=l 


(3-4) 


Finding closed-form solutions for the parameters by maximizing L seems to be 
difficult, since the resulting equations become coupled in terms of param- 
eters. However, optimization techniques, such as the Davidon-Fletcher-Powell 
procedure, can be used to maximize L (refs. 16-18). Now, the problem can be 
formulated as ■ 

Find: P(o)' = i [co = Jl),P((o^ = j|w = Jl),P(« = A) ; i,o,i^ = l,2,-‘-,M 
such that L is maximized subject to the following constraints. \ 
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M 


Z) P(w' ' 

= i |o) = 

a) = 

1 

; a = 1,2,---,M 

i=l 





M 





2 P(co ‘ 
j-1 

= j |o) = 

i) = 

1 

; Z = ^ ,2,'--,M 


M 





E 

P(03 = 

A) 

= 1 






P(co' = i 

|u = a) 

> 0 

9 

i,Jl = 1,2, -.,H 

P(o), = j 

lu = A) 

> 0 

4 

9 

j,A = 1 ,2,*»» ,M 

P{u ^ 

= a) > 

0 ; 

a ■■ 

= 1.2, 
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The numbers of parameters and constraints for different values of M are listed 
in table 3-1. 


TABLE 3-1 PARAMETERS AND CONSTRAINTS FOR A GENERAL CASE 


Number of 
classes, 

M " 

Number of 
parameters , 

2M^+M 

Number of constraints 

Equality, 

2M+1 

Inequality, 

2M^+M 

2 

10 

5 

10 

3 

21 

7 

21 

4 

36 

9 

36 

5 

55 

11 

55 


As indicated in table 3-1, the numbers of parameters and constraints increase 
with the square of the number of classes, resulting in a large number of 
degrees of freedom for. the optimization problem. However, the numbers of 
constraints and parameters can be reduced by modeling the label imperfections 
and the probabilities of misclassification. 
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3.1 MAXIMUM LIKELIHOOD ESTIMATION WITH SIMPLIFIED MODELS 


This section provides (1) models for label imperfections and probabilities of 
mi sclassifi cation apd (2) a formulation of the problem of maximum likelihood 
estimation. To develop a model for describing the probabilities of imperfec- 
tions In the labels, consider the following. 

a. If there are no imperfections in the labels, for different i and j. 


and 


P‘{w' = i|o3 = i) = 1 
P(o)' - j|o) = i) = 0 


(3-6) 


b. If the imperfect label for a pattern is assigned purely at random, irre- 
spective of its true label, for different i and j. 


and 


P(w' = ilto = i) = ^ 
P((ji}' = Jiu = i) = ^ 


(3-7) 


Since, in a practical situation, the assignment of a label lies somewhere 
between the above two extremes, the imperfections in the labels can be modeled 
through a parameter 0-j , which lies between 0 and 1 as 


(1 - e,) 

P((i)' = i [u = i ) = j5j + 6-| 

(1 - e^) 


;p{a)' = j|w = i) = 


M 


(3-8) 


where 0 < 0^ < 1. 


From equations (3-6) through (3-8), it is easily seen that 0^ = 1 denotes no 
imperfections in the labels and 0^ = 0 denotes random labeling. The follow- 
ing shows that this definition satisfies the postulates of probability. 
Consider the following. 
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^ P(co‘ = j|o) = 1) = p(to*-= l|o) = 1) + 2-I P(W = jU = i) 
. j“l 


(1 “ Q-j} (1 - e,) 

+ 6, + L-= 


M 


"1 


j=l 

J7i 


M 


e-| + 1 - 0^ = 1 (3-9) 


thus satisfying the probability rule. However, it is noted that the imperfec- 
tions in the labels can be modeled through some other parameter; for example, 
making e = causes the imperfections to be dependent on a. 0 < a < «>; 

.. 6’^ 

or,, making e = - ^ causes the imperfections to be dependent on g, 

^“.<6 < 00 . In this section, it is assumed that the imperfections are modeled 
through equation (3-8). 


Similarly, classification errors can be modeled as follows 

a. If there are no classification errors, for different i and j, 

P(w^ = i jw = i) = 1 

P(Wj, = j|w = 1) = 0 

b. If the classifier is making random decisions, for different 1 and j, 

’’(“c ' ’'i" ' “ s 

P(<»J = jl«) = 1) = 1 

Since, in general, the truth lies somewhere between the above two extremes, 
the classification errors can bejraodeled through a paran©ter 6p, which lies 
between 0 and 1 as 




and 


P(o)^ == 1 |fa) = i) = 
P(u)^ = j|m = 1) 


0 - 02) 


+ 0 , 


n ■ "2 

0 - 62 ) 

M 


(3-12) 
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where 0 < 02 < 1. As before, it can be seen that this model satisfies the 
postulates of probability. 


Let = (1 - e^) and Xg = e-j; then x^ + Xg = 1. similarly, let X 3 = (1 ^ Sg) 

and X^ = Bg; then ^'3 + ^4 = T- The following expresses the likelihood func- 
tion in terms of the above models. Consider 


P^(j) = P(6)^ = j) 

M 

= X) P{w = il)P(w - jjiu = Jl) 
i=l ^ 

M 

= P{(o = jjoj = j,)P(o) = j) + X) P(w = ^)P(Wc = 0’N = 

a=l 



{3-13) 


P.^ = P(ffi‘ = = 1) 

M 

= X P(t«j' = ih ^ ^)P(w- « i[.w = il)P(a) = 1 ) 
;Jl=l ^ 

M 

= X P(“' “ i k = ii)P(w^ ^ i |<0 = £)P(o5 = £) 
£=1 
£?^t 

’+ P((o' = i|« « OPiu^ = 1k'= i)P(w'= 1) 

H 

= Z 

£=1 
£j^i 


M 


h + x)(^+ X li 


b,h. 

M M 


■f (X^X^ + X2X3) 



+ X2X^P| 


(3-14) 


3-6 . 



Similarly, for i j, 
p' • 


P(aj‘ = i ,0)^ = 

= j) 




M 





E p(«‘ = 1 1 
£=1 

OJ = S-)P(u^ 

= j[u 

= i)P(w 

= 

M 





-0 

II 

01 = ^)P(o)^ 

= jjw 

= A)P(oj 

= A) 

H¥i 










+ P(w' = i luj 

= 1)P(ca^ = 

Olo) = 

i)P(o} = 

i) 

+ P(w' = i |oj 

= j)-p(w^, = 

d|w = 

j)P(o) = 

d) 

M 

^ X, X. 

* (t ^ 2 ! 

1 

1 M 

i M \ 

^3 

+ 

M 

£?«i 





m 





h h ^2^3 

Xi X, 

p. + 

P. 






M M M 'i M 'j 

Substituting equations (3-13) through (3-15) into the likelihood function 
results in 

M « /X,X, X-X, . X.X^ \ 

L = log C . E E ™,J - T ^1 " TT '’j) 


(3-15) 


J 

I M 

2 log 


i=l 

M 


^ 1^3 . /^ 1^4 . ^ 2 ^ . , 1 
^ I M M ^2^4 


L M' 


h. 


E X, log(^ f X^P,) 


(3-16) 
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Now, the problem can be stated as follows. 

Find: X. {i = 1, 2,3,4) and P. (j = 1,2,*-*,M) 

I j 

so that L is maximized' subject to the following constraints. 



(3-17) 


Optimization techniques, such as the Davidon-Fletcher-Powell procedure, can 
be used to maximize L (refs. 16-18). The numbers of parameters and constraints 
for different values of M are listed in table 3-2. 

TABLE 3-2.- PARAiiETERS AND CONSTRAINTS FOR A 
SIMPLIFIED PROBLEM 


Number of 
classes, 

M 

Number of 
parameters, 
4+M 

Number of 

constraints 

Equal ity, 
3 

Inequality, 

4+M 

2 

6 

3 

6 

3 

7 

3 

7 

4 

8 

3 

8 

5 

9 

3 

9 


Table 3-2 indicates that the optimization problem is considerably simplified. 
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3.2 A PRACTICAL APPLICATION 

The maximum likelihood estimation with the simplified models presented in 
section 3.1 is applied to processing remotely sensed Landsat multi spectral 
scanner (MSS) data. Several segments^ are processed in the following manner. 

A linear classifier is trained for two classes. Class 1 is wheat (W) and 
class 2 is other (N). This classifier is used to classify a test set of data 
(104 patterns) for which labels are available and a set of data (209 patterns) 
for which labels are not available. Thus, the classifications corresponding 
to table 2-1 are computed. The labels for the test data are assumed to 
be imperfect. The maximum likelihood estimates of (i = 1,2, 3, 4) and 
Pj (j = 1.2), subject to the constraints of equation (3-17), are obtained 
using the Davidon-Fletcher-Powell optimization procedure (refs. 16,17). 

The Davidon-Fletcher-Powell procedure, in conjunction with an exterior penalty 
function, very efficiently carries out the optimization of the performance 
function, subject to various constraints. In general , these constraints must 
be continuous differentiable functions of the parameters. The original like- 
lihood function is augmented with the functions of the constraints. The 
augmented likelihood function is penalized whenever the constraints are vio- 
lated. For sufficiently large penalties, the. unconstrained optimization of 
the augmented likelihood function can be shown to be equivalent to the orig-' 
inal constrained optimization. 

The results obtained from the optimization of the likelihood function are 
shown in table 3-3. The last column in table 3-3 lists the P(co = 1) values 
computed from the ground- truth information over the entire, segment for each 
segment. The following conclusions can be made from table 3-3. The mean and 
variance of errors of estimated P-j with respect to the ground- truth P-j are 
smaller with the modeling of imperfections in the labels than with the 


^A segment is a 9- by Tl- kilometer (5- by 6-nautical mile) area for which the 
MSS image is divided into a rectangular array of pixels, 117 rows by 
196 columns. 
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TABLE 3-3.- ESTIMATES OF A PRIORI PROBABILITY AND P^^ WITH AND WITHOUT 
MODELING OF IMPERFECTIONS IN THE .LABELS 


Segment 

Site description 

Without modeling 
imperfections in 
the labels 

With modeling Imperfections 
in the labels 

Ground- 

truth 

proportion, 

P(w=l) 

County 

State 

t 

'“cc 

P(w'=l jo)=l) 

(a) 


P|=P(0)=1) 

(b) 

1060 

Sherman 

Tex. 

0.3421 

0.8284 

0.8377 

0.9905 

0.2492 

0.229 

1512 

Clay 

Minn, 

.4295 

-7653 

.7678 

1.0000 

.3594 

.337 

1520 

Big Stone 

Minn. 

.2647 

' .7763 

1.0000 

.7790 

.2759 

.299 

1604 

Renville 

N. Dak. 

.5506 

.6378 

.7100 

.8363 

.6030 

.526 

1648 

Spink 

S. Dak. 

.2868 

.8160 

l.QOOO 

.8182 

.2894 

.379 

1677 

Spink 

S. Dak. 

.3838 

.7501 

.7847 

.9445 

.3034 

.341 

- 1734 

Hill 

Mont. 

.4663 

.8857 

.8865 

l.OOQO 

.4486 

.440 

1929 

Blaine 

Mont. 

.4445 

.9422 

l.OOQO 

.9472 

.4672 

.426 

Mean of errors 


0.02391 




0.002388 


Variance of errors 


0.00374 




0.002318 



^Probability of label imperfections. 
^Estimated proportion of class 1. 



estimates obtained assuming the labels are perfect. When there are no imper- 
fections in the labels (i.e., for segments 1520, 1648, and 1929), the esti- 
mates of P-„'s obtained with and without modeling of imperfections in the 

U w 

labels are identical. Furthermore, when the estimated P is 1 (with model- 
ing of label imperfections), the estimated P^^ (assuming labels are perfect) 
is identical with the probability of label imperfections. The P-j and are 
related as follows 

H 

- p\ = P(w' = 1) = 2 P(w‘ = 1 |w = a)P(m = A) (3-18) 

' £=1 


If it is assumed that the labels are perfect, the estimate of P^ is an esti- 
mate of ,P^. Table 3-4 lists the estimate of P^ obtained from equation (3-18) 
and that obtained as a maximum likelihood estimate from equation (2-10), 
assuming the labels are perfect. 

TABLE 3-4.- COMPARISON OF ESTIMATES OF P^ WITH AND WITHOUT 
■ MODELING OF LABEL IMPERFECTIONS 


Segment 

1 

Estimate of P^ , 

. M 

P,= I)P(w'=l M)P{u=j) 
' 0=1 

Maximum likelihood 

1 

estimate of P-j obtained 
from equation (2-10) 


0.3322 

0.3421 


.4246 

.4295 

wsm 

.2759 

.2647 

1604 

.5432 

.5506 


.2894 

.2868 

mm 

.3880 

.3838 

mm 

.4602 

.4663 

m 

.4672 

.4445 


i Columns 2 and 3 of table 3-4 are almost identical, thus verifying the validity 
' of the models used in defining the label imperfections. 
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3.3 MAXIMUM LIKELIHOOD ESTIMATION WITH CLASS -DEPENDENT MODELING OF LABEL 
IMPERFECTIONS AND ERROR PROBABILITIES 

When modeling label imperfections and error probabilities, the 6's and hence 
X's can be made class dependent, which increases the complexity of the prob- 
lem. '.For different i and j, the imperfections In the labels can be modeled- as 

D-eJi)] 

P(w' = i|w = i) = j;;j + e^(i-) 

[1 - QAi)l 

•p(w' = j|w = i) = 

0 < (i ) < 1 



Similarly, for different i and j, the error probabilities can be modeled as 


. n - Opd)] 

P(w^ = i |o) = i ) jjjj + .82(1) 

t \ — ^ • ‘ 


P(u^ = j|w = i) = 

0 < 02^^^ - ^ 


(3-20) 


M 


It can be shown that these models satisfy the postulates of probability. 

Let X-j(i) = [1 - 0-j(i)], X 2 (i) = 0-[{i), X 3 (i) = [1 - 02(i)]» and X^d) - 02(i) 
Then, 


X-|',(i) + X2(i) = 1 ; i = l,2,-.-,M 

X3(i) + X^(i) = 1 


(3-21) 


An analysis similar to equations (3-13) through (3-15) yields the following 
equations. 

P^d) = P(w^ = j) 
i X3(£) 

<3-22) 


3-12 



= P(u' = = i) 


=§ 


Ai(l) 


M 


A^d) + x^d) 


Xad) 

M 


+ X2(i)X4(i) 


Pi 


(3-23) 


'P^.j = P((o‘ = i,w^ = j) 


M 


jl=l 


X^(JI) X3(Jl) 


M M 


P„ + 


x,d) 


Xcjd) 

-2 p 

M d’ 


+ M 


(3-24) 


Equations (3-22) through (3-24) can be used to express the likelihood func- 
tion as follows. 


L = log C + 


X,(Jl)X3(Jl) 

2^ lu m,, logiZ^ ' „ ^ 


i=l j=l ■‘J 


£=1 


X2(i)X3(i) Xi(j) 


M 


Pi ^ -V 


^ ( '^ Xi (£) X3(£) 

Z-» m,, log { — is Ci— 


i=l ” 
X^.(i) 


£=1 


H M ' A 


,M '^4 


M 

+ 2 X. log 
i=l ^ 


x.d) + 


X2d)X3(i) 


M 


+ X2(i)X^(i) 


Pi 


M 


jl=l 


X3(i0 


P,-HX4(i)P. 


(3-25) 
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The problem of maximizing L may be stated as follows: 

Find: X^(j) (j = 1,2,***,M; i = 1,2, 3, 4) and (j = 1,2,*' 

so that L is maximized subject to the following constraints. 


M 


i=l 


P, = 1 


X^(i) + X2(i) = 1 ; i = 1.2, 

X3(i) + X^d) =1 ; i = 1,2,*«*,M ' 
>0 ; i = lj2,3,4 and j = 1,2,»*»,M 
> 0 ; i = 1,2,»**,M 


.M) 


(3-26) 


The optimization technique of Davidon, Fletcher, and Powell (refs. 16,17) can 
be used to maximize L in equation (3-25), subject to the constraints of 
equation (3-26). The numbers of parameters and constraints for different 
values of M are listed in table 3-5. 


TABLE 3-5.- PARAMETERS AND CONSTRAINTS FOR 
. CLASS-DEPENDENT MODELS 


Number of 
classes, 

M 

Number of 
parameters, 
4M+M 

Number of 

constraints 

Equality, 

2M+1 

Inequality, 

4M+M 

2 

10 

5 

10 

3 

15 

7 

15 

4 

20 

9 

20 

5 

25 

11 

25 


Table 3-5 shows that the numbers of parameters and constraints grow linearly 
with M. 
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4. IDENTIFICATION OF MISLABELED PATTERNS 


This section considers the problem of identifying mislabeled patterns, if the 
probability of label imperfections is either known or estimated using the 
methods developed in section 3. Some relationships are developed between the 
a priori probabilities and the probability densities with and without imper- 
fections in the labels. The imperfections in the labels are described by the 
probabilities 

= P(w' = i|u = j) ; i,j = 1,2,***,M (4-1) 

where i and j indicate class. We have the constraint, 


M 



(4-2) 


It is assumed that 


,p(Xlo) = j) = p(X|w* = i,w = j) j (4-3) 

That is, given the true label of a pattern, the density of the pattern does 
not depend on its imperfect label. To obtain the relationship between 
p(Xlo) “ i) and p(Xjw' = i), consider 


p(Xio)' = i) 



i,w = j) 



i ,(ii = j)P(w' 


i Iw = j)P(o) = j) 


1 


Pd?" 


M 

2] e,-,-P(w = j)p(X|o) = j) 

- j=l 


(4-4) 


Similarly, the a priori probabilities are related as 

M 

P(aj‘ = i) = S B..P(u = j) (4-5) 

j=l 


4-1 



Inverting equation (4-4) yields the following result for the two-class case. 

- 32-|P(w' = 2)p(X|(d' = 2)] 

P((o = 2}p(X|w = 2} = ^ j - g— g -j [3nP(o)' = 2)p(X|w' = 2} 

11 22 1221 


(4-6) 


Let 


Assuming exists, the following can be obtained from equation (4-4) in the 
multi cl ass case. 


3 = 


Bii 3^2 


^21 ^22 


^M1 ^M2 


'IM 


‘2M 


3 


MM 


(4-7) 


M 


P(w = i)p(X|w - i) = 2 6. P(to' = s)p(X|w' = s) ; i = 1,2,***,M (4-8) 

s=l 


4.1 IDENTIFICATION OF MISLABELED PATTERNS IN THE TWO- CLASS CASE 


The following expressions are developed for the identification of mislabeled 
patterns using a linear classifier. The linear classifier implements a 
decision criterion 


Decide X C w' = 1 if g(X) = W^X + Wq > 0 
Decide X C u' =2 otherwise 


(4-9) 
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It Is assumed that p(Xjw* = i) is multivariate normal; i.e., pCX|(u' = i) 

~ = i>2. Since g(X) is a linear combination of the components of 

pattern vector X, if X is normally distributed, g(X) is also nomally dis- 
tributed. That is, 

p[g(X)jX Cw' = i] ~.'N[m^:,(ap^] ; i = 1 ,2 . (4-10) 


where 


m! = + Wq 


= W'^EIW; 


(4-11) 


To identify and change the labels of mislabeled patterns, the following 
scheme is proposed. 

Change the label of X to to = 1 if g(X) > t-j 
Change the label of X to m = 2 if g(X) < -tg ^ (4-12) 

Do not change the label of X if -t 2 < g(X) < t^ 


The thresholds t-j and -t 2 are used to identify the incorrect labels and are 
determined by specifying the probability a, that mislabeling will occur in 
the label correction process. An expression for the probability that the 
label correction scheme will give an incorrect label is derived in the fol- 
lowing equation. 

Pgi = P(bad label) 

= P(o) = l)P(bad label {X C w = 1) + P(u = 2)P(bad label |X G w = 2) 

= P(o) = l)P[g(X) < -t 2 |X G 0 ) = 1] + P(o) = 2)P[g(X) > t^ |X G u = 2] 

(4-13) 
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Using equations (4-6) and (4-13) obtains the following result. 


P(w = l)P[g(X) < -tglxcu) = 1] = P(co = 1) I p[g(x)ixc W = iMg(X)] 


where 


Similarly, 


(^ 11^22 ■ ^ 12 ^ 21 ^ ) 


69 oP(o)' = 1) / p[g(X)|u' = l]d[g(X)] 


- e2iP(c«)' = 2) 


p[g(X)lo)' = 2]d[g(X)] 


(^11^22 ■ ^12^21M 




3«oP(w' = 1) 


n 


’^(y)dy 


32‘]P(^' 


= 2)/ ^ 1<(y)(ly 

J -00 


’f’(y) = ~=r exPl^ 

>^2tt 


P(».= 2)P[g(X)>t,|XCc. = 23= - S,^ 6 g,) 6 n'’<“' “ 


-ti+m' 


-t^+mi 


i/'(y)dy - 3^2^^'^' = 


(4-14) 


(4-15) 


’l^(y)dy (4-16) 


4-4 



From equations (4~13) through (4-16), the probability of a bad label can 
be obtained as 


BL ^^11^22 ” ^12^21 ^ 


-V"! 


.BjjPC®' = 1) 


P 


il^{y)dy 


^21 ”2) 


L 


-tg-m' 

I 

^{y)dy 


3^/(o)' = 2) 


L 


-t^+m2 

— i — 
02 


'i'(y)dy 


- 3i2p(f*5‘ 



(4-17) 


For a given a, t-j and -t^ can be' computed using an optimization technique such 
as the Davidon-Fletcher-Powell procedure, so that the square of the error 
between a and Pg^ is minimized and can be used in the incorrect label identi- 
fication scheme. 

4.2 AN EXAMPLE OF APPLICATION OF THE INCORRECT LABEL IDENTIFICATION SCHEME 

The two-class imperfect label correction scheme presented in section 4.1 is 
applied to a practical problem in remote sensing. In particular, it is 
applied to Landsat imagery of segment 1060. Data from two acquisitions are 
processed, and each acquisition has four spectral bands. The image is over- 
laid with a rectangular grid of 209 grid intersections, and the labels of 
pixels corresponding to each grid intersection are acquired. A linear clas- 
sifier is trained on one-half of the data. The remaining one-half of the 

data is used as a test data set. Test data set and total data set classifi- 
' cations are obtained using the linear classifier. This results in matrices 
corresponding to table 2-1 (a) and (b). The maximum likelihood estimates of 
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label inperfections are obtained using the simplified models presented in 
section 3.1. The B-matrix and the a priori probabilities obtained are 


a = 


"0.8378 0,1622" 
.0.1622 0.8378. 


P(oj = 1) = 0.24921 
P(w = 2) = 0.75079 


(4-18) 


If a = 0.001 is chosen, upper and lower thresholds and -■t2 that minimize 
the square of the difference between a and Pg|_ are computed using the Davidoo' 
Fletcher-Powell procedure. The patterns of class o)' = 2, the discriminant 
function values of which exceeded tp and the patterns of class w' = 1, the 
discriminant function values of which are less than -tg, are identified and 
marked with circles in figures 4-1 and 4-2. These figures list the labels of 
the pixels of 209 grid Intersections and their relative positions. 


Films of the two acquisitions of segment 1060 used in the processing were ’ 
examined by an analyst-interpreter (AI), and the results are given in 
figures 4-3 and 4-4. 


From an analysis of figures 4-3 and 4-4, it can be concluded that the 
decisions of the label correction scheme are in close agreement with the 
AI interpretations of the imagery films, 

4.3 IDENTIFICATION OF MISLABELED PATTERNS IN TH£ MULTICLASS CASE 
Let g^(X) be the discriminant function of the Hh class ai' =-i, where 

g,-(X) = w]’x +w^-o' ; 1 = 1.2,--,M (4-19) 


The usual decision criterion in a multiclass case is to decide ‘XC (o' = Jl, 
if 

gj^(X) = max gj(X) (4_20) 
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Computed upper threshold t-j = 0.1507 


Legend 


Wheat pixels 


Other pixels 


Pixels identified by label correction scheme as wheat 


,AI decision as wheat but bordering class other 


;AI decision as other 


Figure 4-1.- Diagram of 209 grid intersections showing pixels labeled other 
and other pixels reidentified as wheat using imperfect label identification 
scheme. 


4-7 





























































































































10 


11 


12 


13 


14 


15 


16 


17 


18 


19 


W 


(W) 


U 






® 


® 


® 


w 


w 


® 


® 


w 


® 


w 


w 


w 




w 


w 


w 


® 


w 


w 


w 


® 


® 


w 


(wj 


® 




w 


w 


w 


w 


® 


® 


w 


w 


w 


w 


u 




w 


® 


10 


® 


w 


w 


w 


w 


w 


® 


11 


w 


w 




w 


w 


w 


Computed lower threshold -t 2 = -0.01628 


Legend 

Blank 

Other pixels 

W 

Wheat pixels 

® 

Pixels identified by label correction scheme as other 

B 

AI decision as other but bordering wheat 

* 

AI decision as wheat 


Figure 4-2.— Diagram of 209 grid intersections showing pixels labeled wheat 
and wheat pixels identified as other using imperfect label identification 
scheme. 
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Figure 4-3.— AI labels for patterns where labels were changed 

from wheat to other. 



Figure 4-4.— AI labels for patterns where labels were changed 

from other to wheat. 
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To identify and chapge the labels of mislabeled patterns, the following 
scheme is proposed:' 

Change the label of X from to' = i to w = t if 

g^(X) = max g^.(X) > g^{X) + t^ (4-21) 

j=l,2,*--,M 

oYi 

where t^- is a positive number.' 

Otherwise, do not change the label of X.i 

The threshold t^ for identifying the incorrect labels Is determined by speci- 
fying the probability a, that mislabeling will occur in the label correction 
process of equation (4-:21). An upper bound on the probability that such a 
scheme gives an incorrect label is deriv ed a s follows. 

IPgL = P(to - l)Prg^(X) = 


+ P(to = 2)P|*g^{X) 

+ P{(o - M)Pj'g^(X) 

M 

= S P(o) = i)P[g^( 

M M 

<EE P(w = i)P[g 4 (X) > g.(x) + t .|(0 = 1] (4-22) 

i=l j=l ^ ^ ^ 


m^x gj.(X) > g^(X)- + t^ |w = ij 

j=l,2,--*,M ‘ 

jYI 

= max 9j(X) > 92(X) + tglw “ *** 

j=1.2,***,M 

m 

= max g^(X) > gn^(X) + tj^|to = hJ 
a=i,2.-**,M 


= max.g.(X) > g.(X) + t. |o3 = 1 

J J .1 .1 

j=i,2,..*,n 

J7i 
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jit is assumed that Jhe densities p{Xjw’ = i) are multivariate normal. That 
’is, p{Xj(o‘ = i}~ i = 1 ,2,* •• ,M. 


9j(X) - g^-(X) 

= WTX . w.„ -:W{X . 


= Wj.X + 


Since 9jj{X) is a linear combination of the components of pattern vector X, 
if X is normally distributed, g-.(X) is also normally distributed. That is, 

\J * 

where 

(4-25) 



From equations (4-8), ,(4-22), and (4-25), the following- is obtained. 
M M M 

< E E E 6-P(W = s)P[g.(X) > g.(X) + t. h' = s] 

'5'* 1=1 j=i s=l J 1 1 


M M M 

5-P(a)' 

i=l j=l s=l 


M H M 


= E E E s-3P(.' 

i=l j=l s=l 



where iii(y) is given by equation (4-15). The thresholds t^ (i = 1,2,**»,M) can 
be determined using an optimization technique such as the Davidon-Fletcher- 
Powell procedure. However, it is to be noted that when M = 2, equations (4-17) 


4-n 



and (4-26) are identical. The thresholds are pictorially illustrated in 
figure 4-5. 

Figure 4-5 shows that the imperfect label identification scheme in the multi- 
class case amounts to establishing a region around each decision surface. 
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5. CONCLUSIONS 


In the practical applications of pattern recognition, obtaining labels for 
the patterns is expensive and very often these labels are imperfect. This 
paper has presented the problem of estimating imperfections in the labels 
and the use of these estimates in the identification of mislabeled patterns. 

It is assumed that a set of labeled patterns, the labels of which might be 
imperfect, and a set of unlabeled patterns are available. The classifier 
classifies these patterns, and the results are a confusion matrix for the . 
labeled pattern set and classification counts for the unlabeled set. 

Expressions are presented for the maximum likelihood estimates of classifica- 
tion errors, for percentages of correct classification and proportions, and 
for the asymptotic variances of probability of correct classification and 
proportions. 

Assuming imperfections in the labels, simple models are presented for- model- 
ing imperfections in the labels and classification errors. The problem of 
maximum likelihood estimation of various quantities is formulated for a general 
case, in terms of simplified models and class-dependent models, and their rela- 
tive complexities are discussed. Results of practical applications of maximum 
likelihood estimation of various quantities are presented. 

Assuming the densities are Gaussian and the probabilities of label imperfec- 
tions are known, thresholding schemes are proposed for the identification of 
; mislabeled patterns both for the two-class and the multiclass cases. The prob- 
ability that such an identification scheme results in a wrong decision for a 
pattern is expressed as a function of the thresholds, and the thresholds can 
be computed by specifying the probability of a wrong decision by the imperfect 
label identification scheme. 

Furthemore, the results of applying, these techniques to the processing of 
remotely sensed multi spectral data are presented. 
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